22 research outputs found

    A new method for interoperability between lexical resources using MDA approach

    Get PDF
    International audienceLexical resources are increasingly multiplatform due to the diverse needs of linguists. Merging, comparing, finding correspondences and deducing differences between these lexical resources remain difficult tasks. Thus, inte-roperability between these resources is hard even impossible to achieve. In this context, we establish a new method based on MDA approach to resolve interoperability between lexical resources. The proposed method consists of building common structure (OWL-DL ontology) for involved resources. This common structure has the ability to communicate involved resources. Hence, we may create a complex grid between involved resources allowing transformation from one format to another. We experiment our new built method on an LMF lexicon

    A prototype for projecting HPSG syntactic lexica towards LMF

    Get PDF
    The comparative evaluation of Arabic HPSG grammar lexica requires a deep study of their linguistic coverage. The complexity of this task results mainly from the heterogeneity of the descriptive components within those lexica (underlying linguistic resources and different data categories, for example). It is therefore essential to define more homogeneous representations, which in turn will enable us to compare them and eventually merge them. In this context, we present a method for comparing HPSG lexica based on a rule system. This method is implemented within a prototype for the projection from Arabic HPSG to a normalised pivot language compliant with LMF (ISO 24613 - Lexical Markup Framework) and serialised using a TEI (Text Encoding Initiative) based representation. The design of this system is based on an initial study of the HPSG formalism looking at its adequacy for the representation of Arabic, and from this, we identify the appropriate feature structures corresponding to each Arabic lexical category and their possible LMF counterparts

    Segmentation tool for hadith corpus to generate TEI encoding

    Get PDF
    International audienceA segmentation tool for a hadith corpus is necessary to prepare the TEI hadith encoding process. In this context, we aim to develop a tool allowing the segmentation of hadith text from Sahih al-Bukhari corpus. To achieve this objective, we start by identifying different hadith structures. Then, we elaborate an automatic processing tool for hadith segmentation. This tool will be integrated in a prototype allowing the TEI encoding process. The experimentation and the evaluation of this tool is based on Sahih al-Bukhari corpus. The obtained results were encouraging despite some flaws related to exceptional cases of hadith structure

    A standard TMF modeling for Arabic patents

    Get PDF
    International audiencePatent applications are similarly structured worldwide. They consist of a cover page, a speci cation, claims, drawings (if necessary) and an abstract. In addition to their content (text, numbers and citations), all patent publications contain a relatively rich set of well-de ned metadata. In the Arabic world, there is no North African or Arabian Intellectual Property O ce and therefore no uniform collections of Arabic patents. In Tunisia, for example, there is no digital collection of patent documents and therefore no XML collections. In this context, we aim to create a TMF standardized model for scienti c patents and develop a generator of XML patent collections having a uniform and easy to use structure. To test our approach, we will use a collection of XML scienti c patent documents in three languages (Arabic, French, and English)

    Automatic construction of a TMF Terminological Database using a transducer cascade

    Get PDF
    International audienceThe automatic development of termino-logical databases, especially in a standardized format, has a crucial aspect for multiple applications related to technical and scientific knowledge that requires semantic and terminological descriptions covering multiple domains. In this context, we have two challenges: the first is the automatic extraction of terms in order to build a terminological database, and the second challenge is their normalization into a standardized format. To deal with these challenges, we propose an approach based on a cascade of transducers performed using CasSys tool of Unitex platform that benefits from both: the success of the rule-based approach for the extraction of terms, and the performance of the TMF standard for the representation of terms. We have tested and evaluated our approach on an Arabic scientific and technical documents for the Elevator domain and the results are very encouraging

    Encoding prototype of Al-Hadith Al-Shareef in TEI

    Get PDF
    International audienceThe standardization of Al-Hadith Al-Shareef can guarantee the interoperability and interchangeability with other textual sources and takes the processing of Al-Hadith corpus to a higher level. Still, research works on Hadith corpora had not previously considered the standardization as real objective, especially for some standards such as TEI (Text Encoding Initiative). In this context, we aim at the standardization of Al-Hadith Al-Shareef on the basis of the TEI guidelines. To achieve this objective, we elaborated a TEI model that we customized for Hadith structure. Then we developed a prototype allowing the encoding of Hadith text. This prototype analyses Hadith texts and automatically generates a standardized version of the Hadith in TEI format. The evaluation of the TEI model and the prototype is based on Hadith corpus collected from Sahih Bukhari. The obtained results were encouraging despite some flaws related to exceptional cases of Hadith structure

    Towards modeling Arabic lexicons compliant LMF in OWL-DL

    Get PDF
    International audienceElaborating reusable lexical databases and especially making interoperability operational are crucial tasks effecting both Natural Language Processing (NLP) and Semantic Web. With this respect, we consider that modeling Lexical Markup Framework (LMF) in Web Ontology Language Description Logics (OWL-DL) can be a beneficial attempt to reach these aims. This proposal will have large repute since it concerns the reference standard LMF for modeling lexical structures. In this paper, we study the requirement for this suggestion. We first make a quick presentation of the LMF framework. Next, we define the three ontology definition sublanguages that may be easily used by specific users: OWL Lite, OWL-DL and OWL Full. After comparing of the three, we have chosen to work with OWL-DL. We then define the ontology language OWL and describe the steps needed to model LMF in OWL. Finally, we apply this model to develop an instance for an Arabic lexicon

    Toward the Resolution of Arabic Lexical Ambiguities with Transduction on Text Automaton

    No full text
    International audienceLexical analysis can be a way to remove ambiguity in Arabic text. Resolution of ambiguity is an important task in several Natural Language Processing (NLP) applications. Our proposed resolution method is based essentially on the use of transducers on text automata. Indeed these transducers specify lexical and contextual rules for Arabic. They allow the resolution of lexical ambiguities. Different types of lexical ambiguities are identified and studied to extract an appropriate set of rules. After that, we describe lexical rules in the ELAG system (Elimination of Lexical Ambiguities by Grammars), which can delete paths representing morphosyntactic ambiguities. In addition, we present an experimentation implemented in the Unitex platform with various linguistic resources to obtain disambiguated syntactic structures suitable for parsing. The obtained results are ambitious and can be improved by adding other rules and heuristics

    An Arabic Probabilistic Parser based on a Property Grammar

    No full text
    The specificities of the Arabic parsing such as the agglutination, the vocalization and the relatively order-free of words in the Arabic sentences, remain a major issue to consider. To promote its robustness, such parser should define different types of constraints. The Property Grammar formalism (PG) verify the satisfiability of the constraints directly on the units of the structure, thanks to its properties (or relations). In this context, we propose to build a probabilistic parser with syntactic properties, using a PG, and we measure the production rules in terms of different implicit information and in particular the syntactic properties. We experimented our parser on the treebank ATB using the parsing algorithm CYK and obtained encouraging results. Our method is also automatic for the implementation of most property type. Its generalization for other languages or corpus domains (using treebanks) could be a good perspective. Its combination with the pre-trained models of BERT may also make our parser faster
    corecore